Process Migration and Fault Tolerance of BSPlib Programs Running on Networks of Workstations
نویسندگان
چکیده
This paper describes a system that enables parallel programs written using the BSPlib communications library to migrate processes among a network of workstations. Not only does the system provide fault tolerance of BSPlib jobs, but by utilising a load manager that maintains an approximation of the global load of the system, it is possible to continually schedule the migration of BSP processes onto the least loaded machines in a network. Results are provided for an industrial electro-magnetics application that show that we can achieve similar throughput on a publically available collection of workstations as a dedicated NOW.
منابع مشابه
Improving the Performance of Coordinated Checkpointers on Networks of Workstations using RAID Techniques
Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration , coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept , there are several design decisions concerning the placement of checkpoint les that can impact the performance and functionality of coordinated checkpointers. Although several such che...
متن کاملAnalysing an SQL Application with a BSPlib Call-Graph Profiling Tool
This paper illustrates the use of a post-mortem call-graph profiling tool in the analysis of an SQL query processing application written using BSPlib [4]. Unlike other parallel profiling tools, the architecture independent metric of imbalance in size of communicated data is used to guide program optimisation. We show that by using this metric, BSPlib programs can be optimised in a portable and ...
متن کاملTransparent Fault Tolerance for Parallel Applications on Networks of Workstations
This paper describes a new method for providing transparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the high communicat...
متن کاملMechanism for Implementation of Load Balancing using Process Migration
The feature of load sharing or load balancing involves migration of running processes from highly loaded workstations of a network to the lightly-loaded or idle workstations of the network. This paper describes load balancing techniques to share the workload of the workstations belonging to a particular network to gain better performance from the overall network. The mechanisms of load informat...
متن کاملExecuting multithreaded programs efficiently
This thesis presents the theory, design, and implementation of Cilk (pronounced “silk”) and Cilk-NOW. Cilk is a C-based language and portable runtime system for programming and executing multithreaded parallel programs. Cilk-NOW is an implementation of the Cilk runtime system that transparently manages resources for parallel programs running on a network of workstations. Cilk is built around a ...
متن کامل